{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 4 编写结构化程序"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 4.1 回到基础"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1、赋值：\n",
    "列表赋值是“引用”，改变其中一个，其他都会改变\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['1', '3']\n"
     ]
    }
   ],
   "source": [
    "foo = [\"1\", \"2\"]\n",
    "bar = foo\n",
    "foo[1] = \"3\"\n",
    "print(bar)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[], [], []]\n",
      "[['3'], ['3'], ['3']]\n"
     ]
    }
   ],
   "source": [
    "empty = []\n",
    "nested = [empty, empty, empty]\n",
    "print(nested)\n",
    "nested[1].append(\"3\")\n",
    "print(nested)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[['3'], ['3'], ['3']]\n",
      "[['3'], ['2'], ['3']]\n"
     ]
    }
   ],
   "source": [
    "nes = [[]] * 3\n",
    "nes[1].append(\"3\")\n",
    "print(nes)\n",
    "nes[1] = [\"2\"]                # 这里最新赋值时，不会传递给其他元素\n",
    "print(nes)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[['3'], ['3'], ['3']]\n",
      "[['3'], ['3'], ['new']]\n",
      "[['3'], ['3'], ['3']]\n"
     ]
    }
   ],
   "source": [
    "new = nested[:]\n",
    "print(new)\n",
    "new[2] = [\"new\"]\n",
    "print(new)\n",
    "print(nested)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[['3'], ['3'], ['3']]\n",
      "[['3'], ['3'], ['new2']]\n",
      "[['3'], ['3'], ['3']]\n"
     ]
    }
   ],
   "source": [
    "import copy\n",
    "new2 = copy.deepcopy(nested)\n",
    "print(new2)\n",
    "new2[2] = [\"new2\"]\n",
    "print(new2)\n",
    "print(nested)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2、等式\n",
    "Python提供两种方法来检查一对项目是否相同。\n",
    "\n",
    "is    操作符测试对象的ID。\n",
    "\n",
    "==  检测对象是否相等。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[['Python'], ['Python'], ['Python'], ['Python'], ['Python']]\n",
      "True\n",
      "False\n"
     ]
    }
   ],
   "source": [
    "snake_nest = [[\"Python\"]] * 5\n",
    "snake_nest[2] = ['Python']\n",
    "\n",
    "print(snake_nest)\n",
    "\n",
    "print(snake_nest[0] == snake_nest[1] == snake_nest[2] == snake_nest[3] == snake_nest[4])\n",
    "\n",
    "print(snake_nest[0] is snake_nest[1] is snake_nest[2] is snake_nest[3] is snake_nest[4])\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3、elif 和 else\n",
    "if elif  表示 if 为假而且elif 后边为真，表示如果第一个if执行，则不在执行elif\n",
    "\n",
    "if if     表示无论第一个if是否执行，第二个if都会执行"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1\n",
      "2\n"
     ]
    }
   ],
   "source": [
    "animals = [\"cat\", \"dog\"]\n",
    "if \"cat\" in animals:\n",
    "    print(1)\n",
    "if \"dog\" in animals:\n",
    "    print(2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1\n"
     ]
    }
   ],
   "source": [
    "animals = [\"cat\", \"dog\"]\n",
    "if \"cat\" in animals:\n",
    "    print(1)\n",
    "elif \"dog\" in animals:\n",
    "    print(2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 4.2 序列\n",
    "\n",
    "字符串，列表，元组\n",
    "\n",
    "## 1、元组 \n",
    "由逗号隔开，通常使用括号括起来，可以被索引和切片，并且由长度"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "('walk', 'fem', 3)\n",
      "walk\n",
      "('fem', 3)\n",
      "3\n"
     ]
    }
   ],
   "source": [
    "t = \"walk\", \"fem\", 3\n",
    "print(t)\n",
    "print(t[0])\n",
    "print(t[1:])\n",
    "print(len(t))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2、序列可以直接相互赋值"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['I', 'spectroroute', 'off', 'the', 'turned']\n"
     ]
    }
   ],
   "source": [
    "words = [\"I\", \"turned\", \"off\", \"the\", \"spectroroute\"]\n",
    "words[1], words[4] = words[4], words[1]\n",
    "print(words)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3、处理序列的函数\n",
    "sorted（）函数、reversed（）函数、zip（）函数、enumerate（）函数"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      " ['I', 'spectroroute', 'off', 'the', 'turned']\n",
      "['I', 'off', 'spectroroute', 'the', 'turned']\n",
      "\n",
      " ['I', 'spectroroute', 'off', 'the', 'turned']\n",
      "<list_reverseiterator object at 0x7f446c0544a8>\n",
      "['turned', 'the', 'off', 'spectroroute', 'I']\n",
      "\n",
      " ['I', 'spectroroute', 'off', 'the', 'turned']\n",
      "<zip object at 0x7f446c16e0c8>\n",
      "[('I', 0), ('spectroroute', 1), ('off', 2), ('the', 3), ('turned', 4)]\n",
      "\n",
      " ['I', 'spectroroute', 'off', 'the', 'turned']\n",
      "<enumerate object at 0x7f446c1683a8>\n",
      "[(0, 'I'), (1, 'spectroroute'), (2, 'off'), (3, 'the'), (4, 'turned')]\n"
     ]
    }
   ],
   "source": [
    "print(\"\\n\",words)\n",
    "print(sorted(words))\n",
    "\n",
    "print(\"\\n\",words)\n",
    "print(reversed(words))\n",
    "print(list(reversed(words)))\n",
    "\n",
    "print(\"\\n\",words)\n",
    "print(zip(words, range(len(words))))\n",
    "print(list(zip(words, range(len(words)))))\n",
    "\n",
    "print(\"\\n\",words)\n",
    "print(enumerate(words))\n",
    "print(list(enumerate(words)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4、合并不同类型的序列"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['I', 'turned', 'off', 'the', 'spectroroute']\n",
      "[(1, 'I'), (6, 'turned'), (3, 'off'), (3, 'the'), (12, 'spectroroute')]\n",
      "[(1, 'I'), (3, 'off'), (3, 'the'), (6, 'turned'), (12, 'spectroroute')]\n",
      "I off the turned spectroroute\n"
     ]
    }
   ],
   "source": [
    "words = \"I turned off the spectroroute\".split()\n",
    "print (words)\n",
    "\n",
    "wordlens = [(len(word), word) for word in words]\n",
    "print(wordlens)\n",
    "\n",
    "wordlens.sort()\n",
    "print (wordlens)\n",
    "\n",
    "print(\" \".join(w for (_, w) in wordlens))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5、生成器表达式"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 5.1、列表推导\n",
    "我们一直在大量使用列表推导，因为用它处理文本结构紧凑和可读性好。下面是一个例子，分词和规范化一个文本：\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['``', 'when', 'i', 'use', 'a', 'word', ',', \"''\", 'humpty', 'dumpty', 'said', 'in', 'rather', 'a', 'scornful', 'tone', ',', '...', '``', 'it', 'means', 'just', 'what', 'i', 'choose', 'it', 'to', 'mean', '-', 'neither', 'more', 'nor', 'less', '.', \"''\"]\n"
     ]
    }
   ],
   "source": [
    "from nltk import *\n",
    "text = '''\"When I use a word,\" Humpty Dumpty said in rather a scornful tone,\n",
    "... \"it means just what I choose it to mean - neither more nor less.\"'''\n",
    "print([w.lower() for w in word_tokenize(text)])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 5.2、生成器表达式\n",
    "第二行使用了生成器表达式。这不仅仅是标记方便：在许多语言处理的案例中，生成器表达式会更高效。\n",
    "\n",
    "在[1]中，列表对象的存储空间必须在max()的值被计算之前分配。如果文本非常大的，这将会很慢。\n",
    "\n",
    "在[2]中，数据流向调用它的函数。由于调用的函数只是简单的要找最大值——按字典顺序排在最后的词——它可以处理数据流，而无需存储迄今为止的最大值以外的任何值。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "word\n",
      "word\n"
     ]
    }
   ],
   "source": [
    "print(max([w.lower() for w in word_tokenize(text)]))\n",
    "print (max(w.lower() for w in word_tokenize(text)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 4.3 风格的问题"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1、过程风格与声明风格\n",
    "计算布朗语料库中词的平均长度的程序:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "4.401545438271973"
      ]
     },
     "execution_count": 50,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 过程风格\n",
    "import nltk\n",
    "tokens = nltk.corpus.brown.words(categories='news')\n",
    "count = 0\n",
    "total = 0\n",
    "for token in tokens:\n",
    "    count += 1\n",
    "    total += len(token)\n",
    "total / count"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "4.401545438271973\n"
     ]
    }
   ],
   "source": [
    "# 声明风格\n",
    "total = sum(len(w) for w in tokens)\n",
    "print(total / len(tokens))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "unextinguishable\n"
     ]
    }
   ],
   "source": [
    "# 过程风格\n",
    "text = nltk.corpus.gutenberg.words('milton-paradise.txt')\n",
    "longest = ''\n",
    "for word in text:\n",
    "    if len(word) > len(longest):\n",
    "        longest = word\n",
    "print(longest)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['unextinguishable', 'transubstantiate', 'inextinguishable', 'incomprehensible']\n"
     ]
    }
   ],
   "source": [
    "# 声明风格：使用两个列表推到\n",
    "maxlen = max(len(word) for word in text)\n",
    "print([word for word in text if len(word) == maxlen])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  1       5.40%        the\n",
      "  2       5.02%          ,\n",
      "  3       4.25%          .\n",
      "  4       3.11%         of\n",
      "  5       2.40%        and\n",
      "  6       2.22%         to\n",
      "  7       1.88%          a\n",
      "  8       1.68%         in\n"
     ]
    }
   ],
   "source": [
    "# enumerate() 枚举频率分布的值\n",
    "fd = nltk.FreqDist(nltk.corpus.brown.words())\n",
    "cumulative = 0.0\n",
    "most_common_words = [word for (word, count) in fd.most_common()]\n",
    "for rank, word in enumerate(most_common_words):\n",
    "    cumulative += fd.freq(word)\n",
    "    print(\"%3d %10.2f%% %10s\" % (rank + 1, fd.freq(word) * 100, word))\n",
    "    if cumulative > 0.25:\n",
    "        break"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2、计数器的一些合理用途"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[['The', 'dog', 'gave'],\n",
       " ['dog', 'gave', 'John'],\n",
       " ['gave', 'John', 'the'],\n",
       " ['John', 'the', 'newspaper']]"
      ]
     },
     "execution_count": 58,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sent = ['The', 'dog', 'gave', 'John', 'the', 'newspaper']\n",
    "n = 3\n",
    "[sent[i:i+n] for i in range(len(sent) - n +1)]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3、构建多维数组：使用嵌套列表推导和使用对象复制（[ ] * n）"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 62,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[set(), set(), set(), set(), set(), set(), set()],\n",
      " [set(), set(), set(), set(), set(), set(), set()],\n",
      " [set(), set(), set(), set(), set(), {'Alice'}, set()]]\n"
     ]
    }
   ],
   "source": [
    "# 使用嵌套列表推导可以修改内容\n",
    "m, n = 3, 7\n",
    "array = [[set() for i in range(n)] for j in range(m)]\n",
    "array[2][5].add(\"Alice\")\n",
    "import pprint\n",
    "pprint.pprint(array)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 63,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[{7}, {7}, {7}, {7}, {7}, {7}, {7}],\n",
      " [{7}, {7}, {7}, {7}, {7}, {7}, {7}],\n",
      " [{7}, {7}, {7}, {7}, {7}, {7}, {7}]]\n"
     ]
    }
   ],
   "source": [
    "# 使用对象复制，修改一个其他的也会改变\n",
    "array = [[set()] * n] *m\n",
    "array[2][5].add(7)\n",
    "pprint.pprint(array)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 4.4 函数：结构化编程的基础"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      " 【Python自然语言处理】读书笔记：第四章：编写结构化程序 /*! * * Twitter Bootstrap * */ /*! * Bootstrap v3.3.7 (http://getbootstrap.com) * Copyright 2011-2016 Twitter, Inc. * Licensed under MIT (https://github.com/twbs/bootstrap/blob/master/LICENSE) */ /*! normalize.css v3.0.3 | MIT License | github.com/necolas/normalize.cs\n"
     ]
    }
   ],
   "source": [
    "import re\n",
    "def get_text(file):\n",
    "    \"\"\"Read text from a file, normalizing whitespace and stipping HTML markup\"\"\"\n",
    "    text = open(file).read()\n",
    "    text = re.sub(r\"<.*?>\", \" \", text)\n",
    "    text = re.sub(r\"\\s+\", \" \", text)\n",
    "    return text\n",
    "\n",
    "contents = get_text(\"4.test.html\")\n",
    "print(contents[:300])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Help on function get_text in module __main__:\n",
      "\n",
      "get_text(file)\n",
      "    Read text from a file, normalizing whitespace and stipping HTML markup\n",
      "\n"
     ]
    }
   ],
   "source": [
    "help(get_text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "1、考虑以下三个排序函数。第三个是危险的，因为程序员可能没有意识到它已经修改了给它的输入。一般情况下，函数应该修改参数的内容（my_sort1()）或返回一个值（my_sort2()），而不是两个都做（my_sort3()）。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[1, 2, 3] \n",
      "\n",
      "my_sort2(mylist): [1, 2, 3]\n",
      "[3, 2, 1] \n",
      "\n",
      "my_sort2(mylist): [1, 2, 3]\n",
      "[1, 2, 3] \n",
      "\n"
     ]
    }
   ],
   "source": [
    "def my_sort1(mylist):      # good: modifies its argument, no return value\n",
    "    mylist.sort()\n",
    "def my_sort2(mylist):      # good: doesn't touch its argument, returns value\n",
    "    return sorted(mylist)\n",
    "def my_sort3(mylist):      # bad: modifies its argument and also returns it\n",
    "    mylist.sort()\n",
    "    return mylist\n",
    "\n",
    "mylist = [3,2,1]\n",
    "my_sort1(mylist)\n",
    "print (mylist,\"\\n\")\n",
    "\n",
    "mylist = [3,2,1]\n",
    "print(\"my_sort2(mylist):\",my_sort2(mylist))\n",
    "print (mylist,\"\\n\")\n",
    "\n",
    "\n",
    "mylist = [3,2,1]\n",
    "print(\"my_sort2(mylist):\",my_sort3(mylist))\n",
    "print (mylist,\"\\n\")\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2、参数传递\n",
    "早在4.1节中，你就已经看到了赋值操作，而一个结构化对象的值是该对象的引用。函数也是一样的。要理解Python按值传递参数，只要了解它是如何赋值的就足够了。【意思就是在函数中赋值的时候如果是结构化对象，那么赋值仅仅是引用，如果是字符串、数值等非结构化的变量，则在函数中改变，仅仅是局部变量改变】"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "['noun']\n"
     ]
    }
   ],
   "source": [
    "def set_up(word, properties):\n",
    "    word = 'lolcat'\n",
    "    properties.append('noun')\n",
    "    properties = 5\n",
    "\n",
    "w = ''\n",
    "p = []\n",
    "set_up(w, p)\n",
    "print(w)\n",
    "print(p)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "请注意，w没有被函数改变。当我们调用set_up(w, p)时，w（空字符串）的值被分配到一个新的变量word。在函数内部word值被修改。然而，这种变化并没有传播给w。这个参数传递过程与下面的赋值序列是一样的："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n"
     ]
    }
   ],
   "source": [
    "w = ''\n",
    "word = w\n",
    "word = 'lolcat'\n",
    "print(w)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "让我们来看看列表p上发生了什么。当我们调用set_up(w, p)，p的值（一个空列表的引用）被分配到一个新的本地变量properties，所以现在这两个变量引用相同的内存位置。函数修改properties，而这种变化也反映在p值上，正如我们所看到的。函数也分配给properties一个新的值（数字5）；这并不能修改该内存位置上的内容，而是创建了一个新的局部变量。这种行为就好像是我们做了下列赋值序列：\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['noun']\n"
     ]
    }
   ],
   "source": [
    "p = []\n",
    "properties = p\n",
    "properties.append('noun')\n",
    "properties = 5\n",
    "print(p)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3、变量的作用域"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "当你在一个函数体内部使用一个现有的名字时，Python解释器先尝试按照函数本地的名字来解释。如果没有发现，解释器检查它是否是一个模块内的全局名称。最后，如果没有成功，解释器会检查是否是Python内置的名字。这就是所谓的名称解析的LGB规则：本地（local），全局（global），然后内置（built-in）。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4、参数类型检查\n",
    "我们写程序时，Python不会强迫我们声明变量的类型，这允许我们定义参数类型灵活的函数。例如，我们可能希望一个标注只是一个词序列，而不管这个序列被表示为一个列表、元组（或是迭代器，一种新的序列类型，超出了当前的讨论范围）。\n",
    "\n",
    "然而，我们常常想写一些能被他人利用的程序，并希望以一种防守的风格，当函数没有被正确调用时提供有益的警告。下面的tag()函数的作者假设其参数将始终是一个字符串。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "det\n",
      "noun\n",
      "noun\n"
     ]
    }
   ],
   "source": [
    "def tag(word):\n",
    "    if word in ['a', 'the', 'all']:\n",
    "        return 'det'\n",
    "    else:\n",
    "        return 'noun'\n",
    "\n",
    "print(tag('the'))\n",
    "\n",
    "print(tag('knight'))\n",
    "\n",
    "print(tag([\"'Tis\", 'but', 'a', 'scratch']))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "该函数对参数'the'和'knight'返回合理的值，传递给它一个列表[1]，看看会发生什么——它没有抱怨，虽然它返回的结果显然是不正确的。此函数的作者可以采取一些额外的步骤来确保tag()函数的参数word是一个字符串。一种直白的做法是使用if not type(word) is str检查参数的类型，如果word不是一个字符串，简单地返回Python特殊的空值None。这是一个略微的改善，因为该函数在检查参数类型，并试图对错误的输入返回一个“特殊的”诊断结果。然而，它也是危险的，因为调用程序可能不会检测None是故意设定的“特殊”值，这种诊断的返回值可能被传播到程序的其他部分产生不可预测的后果。如果这个词是一个Unicode字符串这种方法也会失败。因为它的类型是unicode而不是str。这里有一个更好的解决方案，使用assert语句和Python的basestring的类型一起，它是unicode和str的共同类型。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "ename": "AssertionError",
     "evalue": "argument to tag() must be a string",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mAssertionError\u001b[0m                            Traceback (most recent call last)",
      "\u001b[0;32m<ipython-input-22-9bf6c76ff64f>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      6\u001b[0m         \u001b[0;32mreturn\u001b[0m \u001b[0;34m'noun'\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      7\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 8\u001b[0;31m \u001b[0mprint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mtag\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m\"'Tis\"\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'but'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'a'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'scratch'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
      "\u001b[0;32m<ipython-input-22-9bf6c76ff64f>\u001b[0m in \u001b[0;36mtag\u001b[0;34m(word)\u001b[0m\n\u001b[1;32m      1\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0mtag\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mword\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m     \u001b[0;32massert\u001b[0m \u001b[0misinstance\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mword\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mstr\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m\"argument to tag() must be a string\"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m      3\u001b[0m     \u001b[0;32mif\u001b[0m \u001b[0mword\u001b[0m \u001b[0;32min\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0;34m'a'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'the'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'all'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      4\u001b[0m         \u001b[0;32mreturn\u001b[0m \u001b[0;34m'det'\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      5\u001b[0m     \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;31mAssertionError\u001b[0m: argument to tag() must be a string"
     ]
    }
   ],
   "source": [
    "def tag(word):\n",
    "    assert isinstance(word, str), \"argument to tag() must be a string\"\n",
    "    if word in ['a', 'the', 'all']:\n",
    "        return 'det'\n",
    "    else:\n",
    "        return 'noun'\n",
    "\n",
    "print(tag([\"'Tis\", 'but', 'a', 'scratch']))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "如果assert语句失败，它会产生一个不可忽视的错误而停止程序执行。此外，该错误信息是容易理解的。程序中添加断言能帮助你找到逻辑错误，是一种防御性编程。一个更根本的方法是在本节后面描述的使用文档字符串为每个函数记录参数的文档。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5、功能分解\n",
    "当我们使用函数时,主程序可以在一个更高的抽象水平编写,使其结构更透明,例如:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "4.401545438271973\n"
     ]
    }
   ],
   "source": [
    "import nltk\n",
    "tokens = nltk.corpus.brown.words(categories='news')\n",
    "total = sum(len(w) for w in tokens)\n",
    "print(total / len(tokens))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "思考例 4-2 中 freq_words 函数。它更新一个作为参数传递进来的频率分布的内容,并输出前 n 个最频繁的词的链表。\n",
    "\n",
    "例 4-2. 设计不佳的函数用来计算高频词。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[]\n",
      "[\"''\", ',', ':1', 'the', ':', 'of', '{', ';', '}', '(', ')', \"'\", 'archives', 'and', '.', '[', ']', '``', 'national', 'documents']\n",
      "\n",
      " [\"''\", ',', ':1', 'the', ':', 'of', '{', ';', '}', '(', ')', \"'\", 'archives', 'and', '.', '[', ']', '``', 'national', 'documents', 'founding', '#', 'to', 'declaration', 'constitution', 'a', 'visit', 'online', 'freedom', 'for']\n"
     ]
    }
   ],
   "source": [
    "from nltk import *\n",
    "from urllib import request\n",
    "from bs4 import BeautifulSoup\n",
    "\n",
    "def freq_words(url, freqdist, n):\n",
    "    html = request.urlopen(url).read().decode('utf8')\n",
    "    raw = BeautifulSoup(html).get_text()\n",
    "    for word in word_tokenize(raw):\n",
    "        freqdist[word.lower()] += 1\n",
    "    result = []\n",
    "    for word, count in freqdist.most_common(n):\n",
    "        result = result + [word]\n",
    "    print(result)\n",
    "constitution = \"http://www.archives.gov/national-archives-experience\" \\\n",
    "\"/charters/constitution_transcript.html\"\n",
    "fd = nltk.FreqDist()\n",
    "print([w for (w, _) in fd.most_common(20)])\n",
    "freq_words(constitution, fd, 20)\n",
    "print(\"\\n\",[w for (w, _) in fd.most_common(30)])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "这个函数有几个问题。该函数有两个副作用：它修改了第二个参数的内容，并输出它已计算的结果的经过选择的子集。如果我们在函数内部初始化FreqDist()对象（在它被处理的同一个地方），并且去掉选择集而将结果显示给调用程序的话，函数会更容易理解和更容易在其他地方重用。考虑到它的任务是找出频繁的一个词，它应该只应该返回一个列表，而不是整个频率分布。在4.4中，我们重构此函数，并通过去掉freqdist参数简化其接口。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[\"''\", ',', ':1', 'the', ':', 'of', '{', ';', '}', '(', ')', \"'\", 'archives', 'and', '.', '[', ']', '``', 'national', 'documents']\n"
     ]
    }
   ],
   "source": [
    "from urllib import request\n",
    "from bs4 import BeautifulSoup\n",
    "\n",
    "def freq_words(url, n):\n",
    "    html = request.urlopen(url).read().decode('utf8')\n",
    "    text = BeautifulSoup(html).get_text()\n",
    "    freqdist = nltk.FreqDist(word.lower() for word in word_tokenize(text))\n",
    "    return [word for (word, _) in freqdist.most_common(n)]\n",
    "\n",
    "constitution = \"http://www.archives.gov/national-archives-experience\" \\\n",
    "\"/charters/constitution_transcript.html\"\n",
    "print(freq_words(constitution, 20))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 6、编写函数的文档\n",
    "在函数的定义顶部的文档字符串中提供这些描述。这个说明不应该解释函数是如何实现的；实际上，应该能够不改变这个说明，使用不同的方法，重新实现这个函数。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {},
   "outputs": [],
   "source": [
    "def accuracy(reference, test):\n",
    "    \"\"\"\n",
    "    Calculate the fraction of test items that equal the corresponding reference items.\n",
    "\n",
    "    Given a list of reference values and a corresponding list of test values,\n",
    "    return the fraction of corresponding values that are equal.\n",
    "    In particular, return the fraction of indexes\n",
    "    {0<i<=len(test)} such that C{test[i] == reference[i]}.\n",
    "\n",
    "        >>> accuracy(['ADJ', 'N', 'V', 'N'], ['N', 'N', 'V', 'ADJ'])\n",
    "        0.5\n",
    "\n",
    "    :param reference: An ordered list of reference values\n",
    "    :type reference: list\n",
    "    :param test: A list of values to compare against the corresponding\n",
    "        reference values\n",
    "    :type test: list\n",
    "    :return: the accuracy score\n",
    "    :rtype: float\n",
    "    :raises ValueError: If reference and length do not have the same length\n",
    "    \"\"\"\n",
    "\n",
    "    if len(reference) != len(test):\n",
    "        raise ValueError(\"Lists must have the same length.\")\n",
    "    num_correct = 0\n",
    "    for x, y in zip(reference, test):\n",
    "        if x == y:\n",
    "            num_correct += 1\n",
    "    return float(num_correct) / len(reference)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 4.5 更多关于函数"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1、作为参数的函数"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "到目前为止，我们传递给函数的参数一直都是简单的对象，如字符串或列表等结构化对象。Python也允许我们传递一个函数作为另一个函数的参数。现在，我们可以抽象出操作，对相同数据进行不同操作。正如下面的例子表示的，我们可以传递内置函数len()或用户定义的函数last_letter()作为另一个函数的参数："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[4, 4, 2, 3, 5, 1, 3, 3, 6, 4, 4, 4, 2, 10, 1]\n",
      "['e', 'e', 'f', 'e', 'e', ',', 'd', 'e', 's', 'l', 'e', 'e', 'f', 's', '.']\n"
     ]
    }
   ],
   "source": [
    "sent = ['Take', 'care', 'of', 'the', 'sense', ',', 'and', 'the',\n",
    "        'sounds', 'will', 'take', 'care', 'of', 'themselves', '.']\n",
    "def extract_property(prop):\n",
    "    return [prop(word) for word in sent]\n",
    "\n",
    "print(extract_property(len))\n",
    "\n",
    "def last_letter(word):\n",
    "    return word[-1]\n",
    "\n",
    "print(extract_property(last_letter))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "对象len和last_letter可以像列表和字典那样被传递。请注意，只有在我们调用该函数时，才在函数名后使用括号；当我们只是将函数作为一个对象，括号被省略。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Python提供了更多的方式来定义函数作为其他函数的参数，即所谓的lambda 表达式。试想在很多地方没有必要使用上述的last_letter()函数，因此没有必要给它一个名字。我们可以等价地写以下内容："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['e', 'e', 'f', 'e', 'e', ',', 'd', 'e', 's', 'l', 'e', 'e', 'f', 's', '.']"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "extract_property(lambda w: w[-1])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "我们的下一个例子演示传递一个函数给sorted()函数。当我们用唯一的参数（需要排序的链表）调用后者，它使用内置的比较函数cmp()。然而，我们可以提供自己的排序函数，例如按长度递减排序。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[',', '.', 'Take', 'and', 'care', 'care', 'of', 'of', 'sense', 'sounds', 'take', 'the', 'the', 'themselves', 'will']\n",
      "[',', '.', 'and', 'Take', 'care', 'the', 'sense', 'the', 'take', 'care', 'of', 'of', 'will', 'sounds', 'themselves']\n"
     ]
    }
   ],
   "source": [
    "print(sorted(sent))\n",
    "\n",
    "print(sorted(sent, key = lambda x:x[-1]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2、累计函数"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "这些函数以初始化一些存储开始，迭代和处理输入的数据，最后返回一些最终的对象（一个大的结构或汇总的结果）。做到这一点的一个标准的方式是初始化一个空链表，累计材料，然后返回这个链表，如4.6中所示函数search1()。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "fizzled\n",
      "Grizzlies'\n"
     ]
    }
   ],
   "source": [
    "import nltk\n",
    "\n",
    "def search1(substring, words):\n",
    "    result = []\n",
    "    for word in words:\n",
    "        if substring in word:\n",
    "            result.append(word)\n",
    "    return result\n",
    "\n",
    "def search2(substring, words):\n",
    "    for word in words:\n",
    "        if substring in word:\n",
    "            yield word\n",
    "            \n",
    "for item in search1('fizzled', nltk.corpus.brown.words()):\n",
    "    print (item)\n",
    "\n",
    "for item in search2('Grizzlies', nltk.corpus.brown.words()):\n",
    "    print (item)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "函数search2()是一个生成器。第一次调用此函数，它运行到yield语句然后停下来。调用程序获得第一个词，完成任何必要的处理。一旦调用程序对另一个词做好准备，函数会从停下来的地方继续执行，直到再次遇到yield语句。这种方法通常更有效，因为函数只产生调用程序需要的数据，并不需要分配额外的内存来存储输出（参见前面关于生成器表达式的讨论）。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "下面是一个更复杂的生成器的例子，产生一个词列表的所有排列。为了强制permutations()函数产生所有它的输出，我们将它包装在list()调用中[1]。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[['police', 'fish', 'buffalo'],\n",
       " ['fish', 'police', 'buffalo'],\n",
       " ['fish', 'buffalo', 'police'],\n",
       " ['police', 'buffalo', 'fish'],\n",
       " ['buffalo', 'police', 'fish'],\n",
       " ['buffalo', 'fish', 'police']]"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "def permutations(seq):\n",
    "    if len(seq) <= 1:\n",
    "        yield seq\n",
    "    else:\n",
    "        for perm in permutations(seq[1:]):\n",
    "            for i in range(len(perm)+1):\n",
    "                yield perm[:i] + seq[0:1] + perm[i:]\n",
    "\n",
    "list(permutations(['police', 'fish', 'buffalo']))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "permutations函数使用了一种技术叫递归，将在下面4.7讨论。产生一组词的排列对于创建测试一个语法的数据十分有用（8.)。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3、高阶函数"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### filter()\n",
    "我们使用函数作为filter()的第一个参数，它对作为它的第二个参数的序列中的每个项目运用该函数，只保留该函数返回True的项目。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['Take', 'care', 'sense', 'sounds', 'take', 'care', 'themselves']"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "def is_content_word(word):\n",
    "    return word.lower() not in ['a', 'of', 'the', 'and', 'will', ',', '.']\n",
    "sent = ['Take', 'care', 'of', 'the', 'sense', ',', 'and', 'the',\n",
    "        'sounds', 'will', 'take', 'care', 'of', 'themselves', '.']\n",
    "list(filter(is_content_word, sent))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### map()\n",
    "将一个函数运用到一个序列中的每一项。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "21.75081116158339\n",
      "21.75081116158339\n"
     ]
    }
   ],
   "source": [
    "lengths = list(map(len, nltk.corpus.brown.sents(categories='news')))\n",
    "print(sum(lengths) / len(lengths))\n",
    "\n",
    "\n",
    "lengths = [len(sent) for sent in nltk.corpus.brown.sents(categories='news')]\n",
    "print(sum(lengths) / len(lengths))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### lambda 表达式\n",
    "这里是两个等效的例子，计数每个词中的元音的数量。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[2, 2, 1, 1, 2, 0, 1, 1, 2, 1, 2, 2, 1, 3, 0]"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "list(map(lambda w: len(list(filter(lambda c: c.lower() in \"aeiou\", w))), sent))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[2, 2, 1, 1, 2, 0, 1, 1, 2, 1, 2, 2, 1, 3, 0]"
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "[len([c for c in w if c.lower() in \"aeiou\"]) for w in sent]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "列表推导为基础的解决方案通常比基于高阶函数的解决方案可读性更好，我们在整个这本书的青睐于使用前者。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4、命名的参数\n",
    "当有很多参数时，很容易混淆正确的顺序。我们可以通过名字引用参数，甚至可以给它们分配默认值以供调用程序没有提供该参数时使用。现在参数可以按任意顺序指定，也可以省略。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<empty><empty><empty>\n",
      "Alice\n",
      "AliceAliceAliceAliceAlice\n"
     ]
    }
   ],
   "source": [
    "def repeat(msg='<empty>', num=1):\n",
    "    return msg * num\n",
    "print(repeat(num=3))\n",
    "\n",
    "print(repeat(msg='Alice'))\n",
    "\n",
    "print(repeat(num=5, msg='Alice'))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### *args作为函数参数"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['four', 'calling', 'birds']\n",
      "[('four', 'three', 'two'), ('calling', 'French', 'turtle'), ('birds', 'hens', 'doves')]\n",
      "[('four', 'three', 'two'), ('calling', 'French', 'turtle'), ('birds', 'hens', 'doves')]\n"
     ]
    }
   ],
   "source": [
    "song = [['four', 'calling', 'birds'],\n",
    "        ['three', 'French', 'hens'],\n",
    "        ['two', 'turtle', 'doves']]\n",
    "print(song[0])\n",
    "print(list(zip(song[0], song[1], song[2])))\n",
    "\n",
    "print(list(zip(*song)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "应该从这个例子中明白输入*song仅仅是一个方便的记号，相当于输入了song[0], song[1], song[2]。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 命名参数的另一个作用是它们允许选择性使用参数。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [],
   "source": [
    "from nltk import *\n",
    "def freq_words(file, min=1, num=10):\n",
    "    text = open(file).read()\n",
    "    tokens = word_tokenize(text)\n",
    "    freqdist = nltk.FreqDist(t for t in tokens if len(t) >= min)\n",
    "    return freqdist.most_common(num)\n",
    "fw = freq_words('4.test.html', 4, 10)\n",
    "fw = freq_words('4.test.html', min=4, num=10)\n",
    "fw = freq_words('4.test.html', num=10, min=4)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 可选参数的另一个常见用途是作为标志使用。\n",
    "这里是同一个的函数的修订版本，如果设置了verbose标志将会报告其进展情况："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Opening 4.test.html\n",
      "Read in 11 characters\n",
      ". . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . "
     ]
    }
   ],
   "source": [
    "def freq_words(file, min=1, num=10, verbose=False):\n",
    "    freqdist = FreqDist()\n",
    "    if verbose: print(\"Opening\", file)\n",
    "    with open(file) as f:\n",
    "        text = f.read()\n",
    "        if verbose: print(\"Read in %d characters\" % len(file))\n",
    "        for word in word_tokenize(text):\n",
    "            if len(word) >= min:\n",
    "                freqdist[word] += 1\n",
    "                if verbose and freqdist.N() % 100 == 0: print(\".\", sep=\"\",end = \" \")\n",
    "        if verbose: print\n",
    "        return freqdist.most_common(num)\n",
    "fw = freq_words(\"4.test.html\", 4 ,10, True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 注意\n",
    "注意不要使用可变对象作为参数的默认值。这个函数的一系列调用将使用同一个对象，有时会出现离奇的结果，就像我们稍后会在关于调试的讨论中看到的那样。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 注意\n",
    "如果你的程序将使用大量的文件，它是一个好主意来关闭任何一旦不再需要的已经打开的文件。如果你使用with语句，Python会自动关闭打开的文件︰"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {},
   "outputs": [],
   "source": [
    "with open(\"4.test.html\") as f:\n",
    "    data = f.read()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 4.6 程序开发"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "编程是一种技能，需要获得几年的各种编程语言和任务的经验。\n",
    "\n",
    "**关键的高层次能力**：是算法设计及其在结构化编程中的实现。\n",
    "\n",
    "**关键的低层次的能力**包括熟悉语言的语法结构，以及排除故障的程序（不能表现预期的行为的程序）的各种诊断方法的知识。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1、Python模块的结构"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "与其他NLTK的模块一样，distance.py以一组注释行开始，包括一行模块标题和作者信息。\n",
    "\n",
    "（由于代码会被发布，也包括代码可用的URL、版权声明和许可信息。）\n",
    "\n",
    "接下来是模块级的文档字符串，三重引号的多行字符串，其中包括当有人输入help(nltk.metrics.distance)将被输出的关于模块的信息。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'\\nDistance Metrics.\\n\\nCompute the distance between two items (usually strings).\\nAs metrics, they must satisfy the following three requirements:\\n\\n1. d(a, a) = 0\\n2. d(a, b) >= 0\\n3. d(a, c) <= d(a, b) + d(b, c)\\n'"
      ]
     },
     "execution_count": 50,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Natural Language Toolkit: Distance Metrics\n",
    "#\n",
    "# Copyright (C) 2001-2013 NLTK Project\n",
    "# Author: Edward Loper <edloper@gmail.com>\n",
    "#         Steven Bird <stevenbird1@gmail.com>\n",
    "#         Tom Lippincott <tom@cs.columbia.edu>\n",
    "# URL: <http://nltk.org/>\n",
    "# For license information, see LICENSE.TXT\n",
    "#\n",
    "\n",
    "\"\"\"\n",
    "Distance Metrics.\n",
    "\n",
    "Compute the distance between two items (usually strings).\n",
    "As metrics, they must satisfy the following three requirements:\n",
    "\n",
    "1. d(a, a) = 0\n",
    "2. d(a, b) >= 0\n",
    "3. d(a, c) <= d(a, b) + d(b, c)\n",
    "\"\"\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2、多模块程序\n",
    "一些程序汇集多种任务，例如从语料库加载数据、对数据进行一些分析、然后将其可视化。我们可能已经有了稳定的模块来加载数据和实现数据可视化。我们的工作可能会涉及到那些分析任务的编码，只是从现有的模块调用一些函数。4.7描述了这种情景。\n",
    "![4.1](./picture/4.1.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3、错误源头\n",
    "首先，输入的数据可能包含一些意想不到的字符。\n",
    "\n",
    "第二，提供的函数可能不会像预期的那样运作。\n",
    "\n",
    "第三，我们对Python语义的理解可能出错。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "metadata": {},
   "outputs": [
    {
     "ename": "TypeError",
     "evalue": "not enough arguments for format string",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mTypeError\u001b[0m                                 Traceback (most recent call last)",
      "\u001b[0;32m<ipython-input-54-3eb6bed1baa0>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mprint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"%s.%s.%02d\"\u001b[0m \u001b[0;34m%\u001b[0m \u001b[0;34m\"ph.d.\"\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m\"n\"\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;36m1\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
      "\u001b[0;31mTypeError\u001b[0m: not enough arguments for format string"
     ]
    }
   ],
   "source": [
    "print(\"%s.%s.%02d\" % \"ph.d.\", \"n\", 1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "ph.d..n.01\n"
     ]
    }
   ],
   "source": [
    "print(\"%s.%s.%02d\" % (\"ph.d.\", \"n\", 1))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 在函数中命名参数不能设置列表等对象\n",
    "程序的行为并不如预期，因为我们错误地认为在函数被调用时会创建默认值。然而，它只创建了一次，在Python解释器加载这个函数时。这一个列表对象会被使用，只要没有给函数提供明确的值。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['omg', 'teh', 'teh', 'mat']\n",
      "['ur', 'on']\n",
      "['omg', 'teh', 'teh', 'mat', 'omg', 'teh', 'teh', 'mat']\n"
     ]
    }
   ],
   "source": [
    "def find_words(text, wordlength, result=[]):\n",
    "    for word in text:\n",
    "        if len(word) == wordlength:\n",
    "            result.append(word)\n",
    "    return result\n",
    "\n",
    "print(find_words(['omg', 'teh', 'lolcat', 'sitted', 'on', 'teh', 'mat'], 3) )\n",
    "\n",
    "print(find_words(['omg', 'teh', 'lolcat', 'sitted', 'on', 'teh', 'mat'], 2, ['ur']) )\n",
    "\n",
    "print(find_words(['omg', 'teh', 'lolcat', 'sitted', 'on', 'teh', 'mat'], 3) )  # 明显错误\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4、调试技术\n",
    "1.由于大多数代码错误是因为程序员的不正确的假设，你检测bug要做的第一件事是检查你的假设。通过给程序添加print语句定位问题，显示重要的变量的值，并显示程序的进展程度。\n",
    "\n",
    "2.解释器会输出一个堆栈跟踪，精确定位错误发生时程序执行的位置。\n",
    "\n",
    "3.Python提供了一个调试器，它允许你监视程序的执行，指定程序暂停运行的行号（即断点），逐步调试代码段和检查变量的值。你可以如下方式在你的代码中调用调试器："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['omg', 'teh', 'teh', 'mat', 'omg', 'teh', 'teh', 'mat', 'cat']\n",
      "> <string>(1)<module>()\n",
      "(Pdb) step\n",
      "--Call--\n",
      "> <ipython-input-58-510e58e9f243>(1)find_words()\n",
      "-> def find_words(text, wordlength, result=[]):\n",
      "(Pdb) step\n",
      "> <ipython-input-58-510e58e9f243>(2)find_words()\n",
      "-> for word in text:\n",
      "(Pdb) args\n",
      "text = ['dog']\n",
      "wordlength = 3\n",
      "result = ['omg', 'teh', 'teh', 'mat', 'omg', 'teh', 'teh', 'mat', 'cat']\n",
      "(Pdb) next\n",
      "> <ipython-input-58-510e58e9f243>(3)find_words()\n",
      "-> if len(word) == wordlength:\n",
      "(Pdb) b\n",
      "(Pdb) c\n"
     ]
    }
   ],
   "source": [
    "import pdb\n",
    "print(find_words(['cat'], 3)) # [_first-run]\n",
    "\n",
    "pdb.run(\"find_words(['dog'], 3)\") # [_second-run]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "它会给出一个提示(Pdb)，你可以在那里输入指令给调试器。输入help来查看命令的完整列表。输入step(或只输入s)将执行当前行然后停止。如果当前行调用一个函数，它将进入这个函数并停止在第一行。输入next(或只输入n)是类似的，但它会在当前函数中的下一行停止执行。break（或b）命令可用于创建或列出断点。输入continue（或c）会继续执行直到遇到下一个断点。输入任何变量的名称可以检查它的值。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5、防御性编程"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "1.考虑在你的代码中添加assert语句，指定变量的属性，例如assert(isinstance(text, list))。如果text的值在你的代码被用在一些较大的环境中时变为了一个字符串，将产生一个AssertionError，于是你会立即得到问题的通知。\n",
    "\n",
    "2.一旦你觉得你发现了错误，作为一个假设查看你的解决方案。在重新运行该程序之前尝试预测你修正错误的影响。如果bug不能被修正，不要陷入盲目修改代码希望它会奇迹般地重新开始运作的陷阱。相反，每一次修改都要尝试阐明错误是什么和为什么这样修改会解决这个问题的假设。如果这个问题没有解决就撤消这次修改。\n",
    "\n",
    "3.当你开发你的程序时，扩展其功能，并修复所有bug，维护一套测试用例是有益的。这被称为回归测试，因为它是用来检测代码“回归”的地方——修改代码后会带来一个意想不到的副作用是以前能运作的程序不运作了的地方。Python以doctest模块的形式提供了一个简单的回归测试框架。这个模块搜索一个代码或文档文件查找类似与交互式Python会话这样的文本块，这种形式你已经在这本书中看到了很多次。它执行找到的Python命令，测试其输出是否与原始文件中所提供的输出匹配。每当有不匹配时，它会报告预期值和实际值。有关详情，请查询在 documentation at http://docs.python.org/library/doctest.html 上的doctest文档。除了回归测试它的值，doctest模块有助于确保你的软件文档与你的代码保持同步。\n",
    "\n",
    "4.也许最重要的防御性编程策略是要清楚的表述你的代码，选择有意义的变量和函数名，并通过将代码分解成拥有良好文档的接口的函数和模块尽可能的简化代码。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 4.7 算法设计\n",
    "解决算法问题的一个重要部分是为手头的问题选择或改造一个合适的算法。有时会有几种选择，能否选择最好的一个取决于对每个选择随数据增长如何执行的知识。\n",
    "\n",
    "算法设计的另一种方法，我们解决问题通过将它转化为一个我们已经知道如何解决的问题的一个实例。例如，为了检测列表中的重复项，我们可以预排序这个列表，然后通过一次扫描检查是否有相邻的两个元素是相同的。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1、迭代与递归\n",
    "例如，假设我们有n个词，要计算出它们结合在一起有多少不同的方式能组成一个词序列。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 迭代\n",
    "如果我们只有一个词（n=1），只是一种方式组成一个序列。如果我们有2个词，就有2种方式将它们组成一个序列。3个词有6种可能性。一般的，n个词有n × n-1 × … × 2 × 1种方式（即n的阶乘）。我们可以将这些编写成如下代码："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 61,
   "metadata": {},
   "outputs": [],
   "source": [
    "def factorial1(n):\n",
    "    result = 1\n",
    "    for i in range(n):\n",
    "        result *= (i+1)\n",
    "    return result"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 递归\n",
    "假设我们有办法为n-1 不同的词构建所有的排列。然后对于每个这样的排列，有n个地方我们可以插入一个新词：开始、结束或任意两个词之间的n-2个空隙。因此，我们简单的将n-1个词的解决方案数乘以n的值。我们还需要基础案例，也就是说，如果我们有一个词，只有一个顺序。我们可以将这些编写成如下代码：\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 62,
   "metadata": {},
   "outputs": [],
   "source": [
    "def factorial2(n):\n",
    "    if n == 1:\n",
    "        return 1\n",
    "    else:\n",
    "        return n * factorial2(n-1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "尽管递归编程结构简单，但它是有代价的。每次函数调用时，一些状态信息需要推入堆栈，这样一旦函数执行完成可以从离开的地方继续执行。出于这个原因，迭代的解决方案往往比递归解决方案的更高效。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2、权衡空间与时间\n",
    "我们有时可以显著的加快程序的执行，通过建设一个辅助的数据结构，例如索引。4.10实现一个简单的电影评论语料库的全文检索系统。通过索引文档集合，它提供更快的查找。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 66,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Building Index...\n",
      "query> so what does joe do\n",
      "Not found\n",
      "query> quit\n",
      "s funded by her mother . lucy quit working professionally 10\n",
      "erick . i disliked that movie quite a bit , but since \" prac\n",
      "t disaster . babe ruth didn't quit baseball after one season\n",
      "o-be fiance . i think she can quit that job and get a more r\n",
      " and rose mcgowan should just quit acting . she has no chari\n",
      "and get a day job . and don't quit it .                     \n",
      " kubrick , alas , should have quit while he was ahead . this\n",
      "everyone involved should have quit while they were still ahe\n",
      "l die . so what does joe do ? quit his job , of course ! ! w\n",
      "red \" implant . he's ready to quit the biz and get a portion\n",
      "hat he always recorded , they quit and become disillusioned \n",
      " admit that i ? ? ? ve become quite the \" scream \" fan . no \n",
      " again , the fact that he has quit his job to feel what it's\n",
      "school reunion . he has since quit his job as a travel journ\n",
      "ells one of his friends , \" i quit school because i didn't l\n",
      "ms , cursing off the boss and quitting his job ( \" today i q\n",
      "e , the arrival of the now ubiquitous videocassette . burt r\n",
      "in capitol city , that he has quit his job and hopes to open\n",
      "before his death at age 67 to quit filmmaking once a homosex\n",
      " - joss's explanation that he quit the priesthood because of\n",
      " is a former prosecutor , and quit because of tensions betwe\n"
     ]
    }
   ],
   "source": [
    "def raw(file):\n",
    "    with open(file) as f:\n",
    "        contents = f.read()\n",
    "        contents = re.sub(r\"<.*?>\", \" \", contents)\n",
    "        contents = re.sub(\"\\s+\", \" \", contents)\n",
    "        return contents\n",
    "\n",
    "def snippet(doc, term):\n",
    "    text = \" \"*30 + raw(doc) + \" \"*30\n",
    "    pos = text.index(term)\n",
    "    return text[pos - 30 : pos + 30]\n",
    "\n",
    "\n",
    "print(\"Building Index...\")\n",
    "files = nltk.corpus.movie_reviews.abspaths()\n",
    "idx = nltk.Index((w, f) for f in files for w in raw(f).split())\n",
    "\n",
    "\n",
    "query = \" \"\n",
    "while query != \"quit\":\n",
    "    query = input(\"query> \")\n",
    "    if query in idx:\n",
    "        for doc in idx[query]:\n",
    "            print(snippet(doc, query))\n",
    "    else:\n",
    "        print(\"Not found\")\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "一个更微妙的空间与时间折中的例子涉及使用整数标识符替换一个语料库的词符。我们为语料库创建一个词汇表，每个词都被存储一次的列表，然后转化这个列表以便我们能通过查找任意词来找到它的标识符。每个文档都进行预处理，使一个词列表变成一个整数列表。现在所有的语言模型都可以使用整数。见4.11中的内容，如何为一个已标注的语料库做这个的例子的列表。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 67,
   "metadata": {},
   "outputs": [],
   "source": [
    "def preprocess(tagged_corpus):\n",
    "    words = set()\n",
    "    tags = set()\n",
    "    for sent in tagged_corpus:\n",
    "        for word, tag in sent:\n",
    "            words.add(word)\n",
    "            tags.add(tag)\n",
    "    wm = dict((w, i) for (i, w) in enumerate(words))\n",
    "    tm = dict((t, i) for (i, t) in enumerate(tags))\n",
    "    return [[(wm[w], tm[t]) for (w, t) in sent] for sent in tagged_corpus]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 集合中的元素会自动索引，所以测试一个大的集合的成员将远远快于测试相应的列表的成员。\n",
    "空间时间权衡的另一个例子是维护一个词汇表。如果你需要处理一段输入文本检查所有的词是否在现有的词汇表中，词汇表应存储为一个集合，而不是一个列表。集合中的元素会自动索引，所以测试一个大的集合的成员将远远快于测试相应的列表的成员。\n",
    "\n",
    "我们可以使用timeit模块测试这种说法。Timer类有两个参数：一个是多次执行的语句，一个是只在开始执行一次的设置代码。我们将分别使用一个整数的列表[1]和一个整数的集合[2]模拟10 万个项目的词汇表。测试语句将产生一个随机项，它有50％的机会在词汇表中[3]。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 68,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.024326185999598238\n",
      "0.005347012998754508\n"
     ]
    }
   ],
   "source": [
    "from timeit import Timer\n",
    "vocab_size = 100000\n",
    "setup_list = \"import random; vocab = range(%d)\" % vocab_size          #[1]\n",
    "setup_set = \"import random; vocab = set(range(%d))\" % vocab_size   #[2]\n",
    "statement = \"random.randint(0, %d) in vocab\" % (vocab_size * 2)     #[3]\n",
    "print(Timer(statement, setup_list).timeit(1000))\n",
    "\n",
    "print(Timer(statement, setup_set).timeit(1000))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "执行1000 次链表成员资格测试总共需要2.8秒，而在集合上的等效试验仅需0.0037 秒，也就是说快了三个数量级！"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3、动态规划\n",
    "动态规划(Dynamic programming)是一种自然语言处理中被广泛使用的算法设计的一\n",
    "般方法。\n",
    "“programming”一词的用法与你可能想到的感觉不同,是规划或调度的意思。动态\n",
    "规划用于解决包含多个重叠的子问题的问题。不是反复计算这些子问题,而是简单的将它们\n",
    "的计算结果存储在一个查找表中。在本节的余下部分,我们将介绍动态规划,但在一个相当\n",
    "不同的背景:句法分析,下介绍。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Pingala 是大约生活在公元前 5 世纪的印度作家,作品有被称为《Chandas Shastra》的\n",
    "梵文韵律专著。Virahanka 大约在公元 6 世纪延续了这项工作,研究短音节和长音节组合产\n",
    "生一个长度为 n 的旋律的组合数。短音节,标记为 S,占一个长度单位,而长音节,标记为\n",
    "L,占 2 个长度单位。Pingala 发现,例如:有 5 种方式构造一个长度为 4 的旋律:V 4 = {L\n",
    "L, SSL, SLS, LSS, SSSS}。请看,我们可以将 V 4 分成两个子集,以 L 开始的子集和以 S\n",
    "开始的子集,如(1)所示。\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "(1) V 4 =\n",
    "    LL, LSS\n",
    "        i.e. L prefixed to each item of V 2 = {L, SS}\n",
    "    SSL, SLS, SSSS\n",
    "        i.e. S prefixed to each item of V 3 = {SL, LS, SSS}\n",
    "V1 = { S}\n",
    "V0 = {\"\"}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "有了这个观察结果,我们可以写一个小的递归函数称为 virahanka1()来计算这些旋\n",
    "律,如例 4-9 所示。请注意,要计算 V 4 ,我们先要计算 V 3 和 V 2 。但要计算 V 3 ,我们先要\n",
    "计算 V 2 和 V 1 。在(2)中描述了这种调用结构。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "例 4-9. 四种方法计算梵文旋律:\n",
    "(一)迭代(递归);\n",
    "(二)自底向上的动态规划;\n",
    "(三)自上而下的动态规划;\n",
    "(四)内置默记法。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 74,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['SSSS', 'SSL', 'SLS', 'LSS', 'LL']\n",
      "['SSSS', 'SSL', 'SLS', 'LSS', 'LL']\n",
      "['SSSS', 'SSL', 'SLS', 'LSS', 'LL']\n",
      "['SSSS', 'SSL', 'SLS', 'LSS', 'LL']\n"
     ]
    }
   ],
   "source": [
    "def virahanka1(n):\n",
    "    if n == 0:\n",
    "        return [\"\"]\n",
    "    elif n == 1:\n",
    "        return [\"S\"]\n",
    "    else:\n",
    "        s = [\"S\" + prosody for prosody in virahanka1(n-1)]\n",
    "        l = [\"L\" + prosody for prosody in virahanka1(n-2)]\n",
    "    return s + l\n",
    "\n",
    "def virahanka2(n):\n",
    "    lookup = [[\"\"], [\"S\"]]\n",
    "    for i in range(n-1):\n",
    "        s = [\"S\" + prosody for prosody in lookup[i+1]]\n",
    "        l = [\"L\" + prosody for prosody in lookup[i]]\n",
    "        lookup.append(s + l)\n",
    "    return lookup[n]\n",
    "\n",
    "             \n",
    "def virahanka3(n, lookup={0:[\"\"], 1:[\"S\"]}):\n",
    "    if n not in lookup:\n",
    "        s = [\"S\" + prosody for prosody in virahanka3(n - 1)]\n",
    "        l = [\"L\" + prosody for prosody in virahanka3(n - 2)]\n",
    "        lookup[n] = s + l\n",
    "    return lookup[n]\n",
    "\n",
    "from nltk import memoize\n",
    "@memoize\n",
    "def virahanka4(n):\n",
    "    if n == 0:\n",
    "        return [\"\"]\n",
    "    elif n == 1:\n",
    "        return [\"S\"]\n",
    "    else:\n",
    "        s = [\"S\" + prosody for prosody in virahanka4(n - 1)]\n",
    "        l = [\"L\" + prosody for prosody in virahanka4(n - 2)]\n",
    "        return s + l\n",
    "\n",
    "print(virahanka1(4))\n",
    "\n",
    "print(virahanka2(4))\n",
    "\n",
    "print(virahanka3(4))\n",
    "\n",
    "print(virahanka4(4))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 法1.递归（迭代）\n",
    "正如你可以看到,V 2 计算了两次。这看上去可能并不像是一个重大的问题,但事实证\n",
    "明,当 n 变大时使用这种递归技术计算 V 20 ,我们将计算 V 2 4,181 次;对 V 40 我们将计算 V 2\n",
    "63245986 次!\n",
    "### 法2.动态规划（自下而上）\n",
    "函数 virahanka2()实现动态规划方法解决这个问题。它的工作原\n",
    "理是使用问题的所有较小的实例的计算结果填充一个表格(叫做 lookup),一旦我们得到\n",
    "了我们感兴趣的值就立即停止。此时,我们读出值,并返回它。最重要的是,每个子问题只\n",
    "计算了一次。\n",
    "### 法3.动态规划（自上而下）\n",
    "请注意,virahanka2()所采取的办法是解决较大问题前先解决较小的问题。因此,这\n",
    "被称为自下而上的方法进行动态规划。不幸的是,对于某些应用它还是相当浪费资源的,因\n",
    "为它计算的一些子问题在解决主问题时可能并不需要。采用自上而下的方法进行动态规划可\n",
    "避免这种计算的浪费,如例 4-9 中函数 virahanka3()所示。不同于自下而上的方法,这种\n",
    "方法是递归的。通过检查是否先前已存储了结果,它避免了 virahanka1()的巨大浪费。如\n",
    "果没有存储,就递归的计算结果,并将结果存储在表中。最后一步返回存储的结果。\n",
    "### 法4.内置默记法\n",
    "最后一\n",
    "种方法,invirahanka4(),使用一个 Python 的“装饰器”称为默记法(memoize),它会\n",
    "做 virahanka3()所做的繁琐的工作而不会搞乱程序。这种“默记”过程中会存储每次函数\n",
    "调用的结果以及使用到的参数。如果随后的函数调用了同样的参数,它会返回存储的结果,\n",
    "而不是重新计算。(这方面的 Python 语法超出了本书的范围。)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 4.8 Python 库的样例"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "例 4-10. 布朗语料库中不同部分的情态动词频率。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 85,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'news': [93, 86, 66, 38, 50, 389], 'religion': [82, 59, 78, 12, 54, 71], 'hobbies': [268, 58, 131, 22, 83, 264], 'government': [117, 38, 153, 13, 102, 244], 'adventure': [46, 151, 5, 58, 27, 50]}\n"
     ]
    },
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYgAAAEICAYAAABF82P+AAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDMuMC4zLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvnQurowAAIABJREFUeJzt3Xt8VNW9///Xh0gJAoIi8kVBgxZRIRAhQSiC0VZRq0K1iNQqlCrHqket1q/Uc07F2+Orv1K1aKuiKIiiqBxvVE/rLchFCsQGlItcajyAVAEFuRvg8/tjr4QhTJIJZGZyeT8fj3lkz9pr7/3ZM5P9mbXXnrXN3RERESmvUboDEBGR2kkJQkRE4lKCEBGRuJQgREQkLiUIERGJSwlCRETiUoKQOsPM2prZB2a22cz+UM1ljzWzLWaWkaz4qsvM3My+n0C9rFD3kBTFlW9mq2tgPcPNbGZNxCTpoQRRC5hZsZltDwew0sfR6Y6rFhoJrAcOc/dbys80s/ZmNtXM1pvZJjP7xMyGA7j7/7p7c3ffXd2NmllBOEB3L1f+SijPP7DdOXhm9j9mdlec8oFm9q9UJZVUMLOzzez98AVhg5kVmdltZpaZ7tjqKyWI2uPCcAArfXxRvkJ9+mc/QMcBi73iX3dOAlaFeq2BK4Ava2jby4ArS5+YWWugD7CuhtZ/oCYCPzczK1d+BfCcu++qzspq62fMzAYDLwOTgePcvTUwBGgPdEjC9mrl65By7q5Hmh9AMfCjOOVZgAO/BP4X+CCU9wZmAxuBBUB+zDIdgenAZuBt4BHg2TAvH1hd0baJvjCMAlYCG4AXgSPKxTIsxLIe+I+Y9WQAt4dlNwOFRP+4fwL+UG6brwO/ruC1+AEwD9gU/v4glE8ASoDvgC0VvF5bgJwK1lsa/yHAEcBqoqQM0BxYAVxZwbIFwO/CMhmh7Hrg0VCWH8qaAA8BX4THQ0CTmPXcCqwN80aEeL4f5v0Y+AfwLVGSGx0v9jixNQ2vVf+YssOBHUD3mLjGhPftS+AxoGnsZwK4DfgXUZItLbs9vM/FwOUx6z8fWBze5zXAbyp43YYDs4g+g5uApcAPw7zBQGG5+jcDr8VZj4XX5JYq/o8O5vM7migBPRveg6sqW19DeaQ9AD0SShDPAM3CweCY8GE9P3yAzw7P24RlPgQeCAeF/uGfONEEcSMwh+hbWRPgceD5crE8EeLoDuwETg7zbwU+BjqHf+juRN/iexEdEBuFekcC24C2cfb3COAbom+/hwBDw/PWYf4E4J5KXsd3wgHpMuDYCl7LQ8Lzc4gOiEeFfXq5kvUWhAPG34DzQtlcohZEbIK4K7x+RwFtiJL43WHeuUQH567hvZzMvgkiH8gO72m3UHdQvNjjxPcE8GTM838DimKeP0iUlI8AWgBvAP8vZru7gPvDe940pqz0c3QGsBXoHJZZC/QL04cDPSqIa3hYz6+BxkTf+DeFOJoAX5d+fkL9fwCXxFnPSWH/s6r4PzqYz+9ooi8gg8J70LSy9TWUR9oD0KPsIL2FqEWwEXg1lJd+qI+PqXsbMKnc8n8l+mZ0bPiHbBYzbzKJJ4glhG944Xm78E9zSEws7WPmzwUuC9OfAgMr2L8lwNlh+nrgzQrqXQHMLVf2ITA8TE+g8gRxOHAfsAjYDRQBeeVey0Ni6j9MlNTWEJJQBestIEoQPweeDwesZWFebIJYCZwfs9wAoDhMPwXcFzPvRGISRJxtPgQ8WFHs5eqeHj43meH5LEILjShZbwVOiKnfB/gs5jPxXemyMWXlP0cvAv8Vpv+XKAkdVsXnejjRlwMr95m5Ikw/CtwbprsQfRloUsH+ebkYXwj7vC1mfQfz+R1NaKGX+9zGXV8yjwe16aE+iNpjkLu3Co9B5eatipk+DhhsZhtLH0T/QO2Ao4Fv3H1rTP3PqxHDccArMetdQnSgbRtT518x09uITs9AdDppZQXrnUh0cCX8nVRBvaPjxPs5UaupSu7+jbuPcvcuIeYi4NU45+dLjSP6Rj/B3TcksIn/Bs4iSnLx9qF8/J+HstJ5q8rNK2Nmp4UO2HVmtgm4hqi1VSV3n0l0ymSQmZ1A1GqbHGa3AQ4FCmPe1/8J5aXWufuOcquN9zkq3ZdLiFqwn5vZdDPrU0l4azwcXeOsZyLws/D+XAG86O4746yj9L1pF7PPl7l7K+AjotObcHCfX9j3/Ul0ffWaEkTdEPsPtoqoBdEq5tHM3e8javofbmbNYuofGzO9lehgAUC45DP2QLGK6BRK7Loz3X1NAjGuAk6oYN6zwMBwFdDJwKsV1PuC6J8y1rFE3/Crxd3XE513P5rolMY+wr6PIzp9d20il5u6+zbgLeBXxE8Q5eM/NpRB9N50KDcv1mSi00Ad3L0lUT9BRYktnmeIOtF/DvzV3Us759cD24EuMe9pS3ePPTA6+4v3OfoCwN3nuftAolNprxK1LipyTLkEHbueOUStl37Az6j4i8OnRJ+BiyvZDhzc5xf2fx0Odn11nhJE3fMscKGZDTCzDDPLDNett3f3z4H5wJ1m9j0zOx24MGbZZUCmmf3YzBoD/0l0brXUY8C9ZnYcgJm1MbOBCcb1JHC3mXWySLdwpQ/uvpqow3kSMNXdt1ewjjeBE83sZ2Z2iJkNAU4BpiUSgJndb2Zdw7ItiA7kKypoHdxOdEAYAfweeCbB30jcDpzh7sVx5j0P/Gd43Y4k6th+Nsx7ERhuZqeY2aHAHeWWbQF87e47zKwX0QGzOp4BfgRcTfTNHAB330N03v1BMzsKwMyOMbMBCayz9HPUD7gAeCk8v9zMWrp7CVGH7p5K1nEUcIOZNQ5XIp1M9D7Hxv0IUBJaQvsJ+3ALcIeZXW1mh4fPWCf2/TZ/MJ/feGp6fXWOEkQd4+6rgIFEB6p1RN9ybmXve/kz4DSiDsA7iP4BS5fdBFxLdDBfQ9SiiP1B1B+JvsX+zcw2E3XQnZZgaA8QHQT/RnTQGE/U0VdqIlEnbEXfEgkH8guIDgYbgP8LXBBaA4k4FHiF6Nz0P4m+zV9UvpKZ9SS6YuZKj34XcT9RshhV1Qbc/YuKDmTAPUQJeiFR38ZHoQx3f4uoX+E9oium3iu37LXAXeF1/x2VfyuPF1cxUad4M6L3MNZtYZtzzOxbos78zlWs8l9EfQJfAM8B17j70jDvCqA4rOsa4PJK1vN3oBNRS+Ze4KflEvYkotN8z8ZZNnb/pgCXErWQVoX1vUjUCnwpVDuYz288Nb2+Osf2PT0o9Y2ZjSbqCP15VXWTHEd/ooPAca4PnQRm1hT4iuhKqOXpjkf2pRaEJF04nXUj0aWYSg4S61fAPCWH2km/FpSkMrOTiU67LAB+keZwpBYxs2KijvjyV+1JLaFTTCIiEpdOMYmISFx1+hTTkUce6VlZWekOQ0SkTiksLFzv7m2qqlenE0RWVhbz589PdxgiInWKmSU0wkLSTzGFH3P9w8ymhecdzezvZrbCzKaY2fdCeZPwfEWYn5Xs2EREpGKp6IO4kWgMk1L3Ew1C9n2iH+L8MpT/kmj8l+8TjT55fwpiExGRCiQ1QZhZe6Jx7p8Mz41osLOXQ5WJ7L3EbSB7hwh4GfhhJYOsiYhIkiW7D+IhouESWoTnrYGNvvcuV6vZO1LnMYTRFN19VxjRsjXRT+rLmNlIoltPcuyx5cc7g5KSElavXs2OHeUHp5RUyMzMpH379jRu3DjdoYjIQUpagjCzC4Cv3L3QavCeve4+jmj8FXJzc/f7Ecfq1atp0aIFWVlZqAGSWu7Ohg0bWL16NR07dkx3OCJykJJ5iqkvcFH4teQLRKeW/gi0sr33e23P3qGc1xCGQw7zW7J3HPiE7dixg9atWys5pIGZ0bp1a7XeROqJpCUId/+tu7d39yyiW0C+5+6XA+8DPw3VhgGvhenXw3PC/PcOdNweJYf00WsvUn+k45fUtwE3m9kKoj6G8aF8PNA6lN9MAkMvi4hI8qTkh3LuXkB0X1/c/Z9Et0QsX2cHMLjGN17T32g1dpWINBB1+pfUIiI1JpEvkw3sC6IG60uC4uJiTj75ZK6++mq6dOnCOeecw/bt21m5ciXnnnsuPXv2pF+/fixdupTdu3fTsWNH3J2NGzeSkZHBBx98AED//v1Zvnw506dPJycnh5ycHE499VQ2b96c5j0UkYZACSJJli9fznXXXceiRYto1aoVU6dOZeTIkTz88MMUFhYyZswYrr32WjIyMujcuTOLFy9m5syZ9OjRgxkzZrBz505WrVpFp06dGDNmDH/6058oKipixowZNG3atOoAREQOkk4xJUnHjh3JyckBoGfPnhQXFzN79mwGD97bzbJz504A+vXrxwcffMBnn33Gb3/7W5544gnOOOMM8vLyAOjbty8333wzl19+ORdffDHt27dP/Q6JSIOjFkSSNGnSpGw6IyODr7/+mlatWlFUVFT2WLIkGqKqf//+zJgxg7lz53L++eezceNGCgoK6NevHwCjRo3iySefZPv27fTt25elS5fG3aaISE1SgkiRww47jI4dO/LSSy8B0a+OFyxYAECvXr2YPXs2jRo1IjMzk5ycHB5//HH69+8PwMqVK8nOzua2224jLy9PCUJEUqL+Jwj3mn0chOeee47x48fTvXt3unTpwmuvRb8RbNKkCR06dKB3795AdMpp8+bNZGdnA/DQQw/RtWtXunXrRuPGjTnvvPMO7jUREUlAnb4ndW5urpe/YdCSJUs4+eST0xSRgN4DqaMa0GWuZlbo7rlV1av/LQgRETkgShAiIhKXEoSIiMSlBCEiInEpQYiISFxKECIiEle9H2rD7qzZ4b79jpq7zC0/P58xY8aQm5vL+eefz+TJk2nVqlWF9X/3u9/Rv39/fvSjH9VYDCIiFan3CSLd3B13p1Gjyhtrb775ZpXruuuuu2oqLBGRKukUUxIUFxfTuXNnrrzySrp27cqkSZPo06cPPXr0YPDgwWzZsmW/ZbKysli/fj0Ad999N507d+b0009n6NChjBkzBoDhw4fz8ssvA/Duu+9y6qmnkp2dzYgRI8oG/svKyuKOO+6gR48eZGdna1gOETlgSUsQZpZpZnPNbIGZLTKzO0P5BDP7zMyKwiMnlJuZjTWzFWa20Mx6JCu2VFi+fDnXXnst06dPZ/z48bzzzjt89NFH5Obm8sADD1S43Lx585g6dSoLFizgrbfeovwvxQF27NjB8OHDmTJlCh9//DG7du3i0UcfLZt/5JFH8tFHH/GrX/2qLLmIiFRXMlsQO4Gz3L07kAOca2a9w7xb3T0nPIpC2XlAp/AYCTy63xrrkOOOO47evXszZ84cFi9eTN++fcnJyWHixIl8/vnnFS43a9YsBg4cSGZmJi1atODCCy/cr86nn35Kx44dOfHEEwEYNmxY2U2GAC6++GJg7zDjIiIHIml9EB4N8lR6LqVxeFTWwzsQeCYsN8fMWplZO3dfm6wYk6lZs2ZA1Adx9tln8/zzz6ds26VDjWdkZLBr166UbVdE6pek9kGYWYaZFQFfAW+7+9/DrHvDaaQHzaz0xgnHAKtiFl8dysqvc6SZzTez+evWrUtm+DWid+/ezJo1ixUrVgCwdetWli1bVmH9vn378sYbb7Bjxw62bNnCtGnT9qvTuXNniouLy9Y5adIkzjjjjOTsgIg0WEm9isnddwM5ZtYKeMXMugK/Bf4FfA8YB9wGJHx5jruPC8uRm5tb5TWnNXlZ6oFo06YNEyZMYOjQoWUdyffcc0/Z6aHy8vLyuOiii+jWrRtt27YlOzubli1b7lMnMzOTp59+msGDB7Nr1y7y8vK45pprkr4vItKwpGy4bzP7HbDN3cfElOUDv3H3C8zscaDA3Z8P8z4F8is7xVRfh/vesmULzZs3Z9u2bfTv359x48bRo0fd6bOvD++BNEAa7ns/ybyKqU1oOWBmTYGzgaVm1i6UGTAI+CQs8jpwZbiaqTewqa72PxyskSNHkpOTQ48ePbjkkkvqVHIQkfojmaeY2gETzSyDKBG96O7TzOw9M2sDGFAElJ4beRM4H1gBbAN+kcTYarXJkyenOwQRkaRexbQQODVO+VkV1HfgumTFIyIi1aNfUouISFxKECIiEpcShIiIxFXvE4RZzT4SUVxcTNeuXROOMT8/P+6YS6NHj447ltIXX3zBT3/604TXLyJyIOp9gqiPjj766LJRXUVEkkUJIkl2797N1VdfTZcuXTjnnHPYvn07RUVF9O7dm27duvGTn/yEb775pqz+pEmTyMnJoWvXrsydO7esfMGCBfTp04dOnTrxxBNPAPu2UHbv3s2tt95KXl4e3bp14/HHHwdg7dq19O/fv2ydM2bMSOHei0h9oASRJMuXL+e6665j0aJFtGrViqlTp3LllVdy//33s3DhQrKzs7nzzjvL6m/bto2ioiL+/Oc/M2LEiLLyhQsX8t577/Hhhx9y11138cUXX+yznfHjx9OyZUvmzZvHvHnzeOKJJ/jss8+YPHkyAwYMoKioiAULFpCTk5OyfReR+kF3lEuSjh07lh2Ue/bsycqVK9m4cWPZoHrDhg1j8ODBZfWHDh0KQP/+/fn222/ZuHEjAAMHDqRp06Y0bdqUM888k7lz5+5zsP/b3/7GwoULy045bdq0ieXLl5OXl8eIESMoKSlh0KBBShAiUm1KEElSOuQ2RMNulx7wK2LlesBLn1dUXsrdefjhhxkwYMB+6/zggw/4y1/+wvDhw7n55pu58sorq7UPItKw6RRTirRs2ZLDDz+8rC+g/BDdU6ZMAWDmzJm0bNmybATX1157jR07drBhwwYKCgrIy8vbZ70DBgzg0UcfpaSkBIBly5axdetWPv/8c9q2bcvVV1/NVVddxUcffZSK3RSReqTetyBq0+CLEydO5JprrmHbtm0cf/zxPP3002XzMjMzOfXUUykpKeGpp54qK+/WrRtnnnkm69ev57/+6784+uij97lL3FVXXUVxcTE9evTA3WnTpg2vvvoqBQUF/P73v6dx48Y0b96cZ555JpW7KiL1QMqG+06G+jrcd12n90DqJA33vR+dYhIRkbiUIEREJC4lCBERiUsJQkRE4lKCEBGRuJQgREQkrqT9DsLMMoEPgCZhOy+7+x1m1hF4AWgNFAJXuPt3ZtYEeAboCWwAhrh78UHHUVBwsKvYh+fn1+j66qKNGzcyefJkrr322nSHIiJJlMwWxE7gLHfvDuQA55pZb+B+4EF3/z7wDfDLUP+XwDeh/MFQT6qwa9eulG9z48aN/PnPf075dkUktZKWIDyyJTxtHB4OnAWU3sxgIjAoTA8Mzwnzf2jlBx6qQ+6++246d+7M6aefztChQxkzZkzc4b6XLl1Kr169ypYrLi4mOzsbgMLCQs444wx69uzJgAEDWLt2LRDdYOimm24iNzeXP/7xjwwfPpwbbriBH/zgBxx//PFlA/cVFBRwxhlnMHDgQI4//nhGjRrFc889R69evcjOzmblypUArFu3jksuuYS8vDzy8vKYNWsWEN2waMSIEeTn53P88cczduxYAEaNGsXKlSvJycnh1ltvTdlrKiKpldQ+CDPLMLMi4CvgbWAlsNHdS7/2rgaOCdPHAKsAwvxNRKehyq9zpJnNN7P569atS2b4B2zevHlMnTqVBQsW8NZbb5XdLS7ecN8nnXQS3333HZ999hkQjck0ZMgQSkpK+Pd//3defvllCgsLGTFiBP/xH/9Rto3vvvuO+fPnc8sttwDR/R9mzpzJtGnTGDVqVFm9BQsW8Nhjj7FkyRImTZrEsmXLmDt3LldddRUPP/wwADfeeCO//vWvy+K+6qqrypZfunQpf/3rX5k7dy533nknJSUl3HfffZxwwgkUFRXx+9//Pumvp4ikR1LHYnL33UCOmbUCXgFOqoF1jgPGQTTUxsGuLxlmzZrFwIEDyczMJDMzkwsvvJCtW7dWONz3pZdeypQpUxg1ahRTpkxhypQpfPrpp3zyySecffbZQHRjoHbt2pVtY8iQIftsc9CgQTRq1IhTTjmFL7/8sqw8Ly+vbLkTTjiBc845B4Ds7Gzef/99AN555x0WL15ctsy3337Lli1R4+/HP/4xTZo0oUmTJhx11FH7rFtE6reUDNbn7hvN7H2gD9DKzA4JrYT2wJpQbQ3QAVhtZocALYk6q+u9IUOGMHjwYC6++GLMjE6dOvHxxx/TpUsXPvzww7jLNGvWbJ/nscOLx46vFVveqFGjsueNGjUq67/Ys2cPc+bMITMzc7/tlB+2PB19HiKSHkk7xWRmbULLATNrCpwNLAHeB34aqg0DXgvTr4fnhPnveR0dSbBv37688cYb7Nixgy1btjBt2jSaNWtW4XDfJ5xwAhkZGdx9991lLYPOnTuzbt26sgRRUlLCokWLkhLvOeecU3a6CaCoqKjS+i1atGDz5s1JiUVEao9ktiDaARPNLIMoEb3o7tPMbDHwgpndA/wDGB/qjwcmmdkK4GvgspoIIh2Xpebl5XHRRRfRrVs32rZtS3Z2Ni1btqx0uO8hQ4Zw6623lvVFfO973+Pll1/mhhtuYNOmTezatYubbrqJLl261Hi8Y8eO5brrrqNbt27s2rWL/v3789hjj1VYv3Xr1vTt25euXbty3nnnqR9CpJ7ScN9JsmXLFpo3b862bdvo378/48aNo0ePHukOKyVqy3sgUi0a7ns/9f6GQekycuRIFi9ezI4dOxg2bFiDSQ4iUn8oQSTJ5MmT0x2CiMhB0VhMIiISlxKEiIjEpQQhIiJxKUGIiEhc9b6TusAKanR9+Z5fI+uZMGEC8+fP55FHHqmR9UE00N/s2bP52c9+VmPrFJGGSy2IeqS4uPiArp7avXt3EqIRkbpOCSJJBg0aRM+ePenSpQvjxo0D4Omnn+bEE0+kV69eZUNqb9q0ieOOO449e/YAsHXrVjp06EBJSQkrV67k3HPPpWfPnvTr14+lS5cCVDi896hRo5gxYwY5OTk8+OCDTJgwgeuvv74spgsuuICCcAOl5s2bc8stt9C9e3c+/PDDCocWF5GGSwkiSZ566ikKCwuZP38+Y8eOZc2aNdxxxx3MmjWLmTNnlo2e2rJlS3Jycpg+fToA06ZNY8CAATRu3JiRI0fy8MMPU1hYyJgxY/a5g1u84b3vu+8++vXrR1FREb/+9a8rjW/r1q2cdtppLFiwgNNOO63SocVFpGGq930Q6TJ27FheeeUVAFatWsWkSZPIz8+nTZs2QDT20rJly8qmp0yZwplnnskLL7zAtddey5YtW5g9e3bZkOAAO3fuLJuuaHjvRGVkZHDJJZcAVDm0uIg0TEoQB6ncUFAAFBYW8Oqr7/CnP31IZuah/OY3+Zx00kn73HMh1kUXXcTtt9/O119/TWFhIWeddRZbt26lVatWFY6sWtHw3rEOOeSQslNXADt27CibzszMJCMjo2z5yoYWF5GGSaeYkmDLlk20aHE4mZmHUly8lDlz5rB9+3amT5/Ohg0bKCkp4aWXXiqr37x5c/Ly8rjxxhu54IILyMjI4LDDDqNjx45l9dydBQsWVLrd8sNwZ2VlUVRUxJ49e1i1ahVz586Nu1wqhxYXkbqj3rcgauqy1Oro0+dcpk59jMGDT+a44zrTu3dv2rVrx+jRo+nTpw+tWrUiJydnn2VKbxpU2okM8Nxzz/GrX/2Ke+65h5KSEi677DK6d+9e4Xa7detGRkYG3bt3Z/jw4dx000107NiRU045hZNPPrnCAQNTObS4SH1X1aCwdWlAWA33fZDinWIqL7fKQXXrFw33LXVSDQ33XRcSRKLDfesUk4iIxKUEISIicdXLBFGXT5vVdXrtReqPpCUIM+tgZu+b2WIzW2RmN4by0Wa2xsyKwuP8mGV+a2YrzOxTMxtwINvNzMxkw4YNOlClgbuzYcMGMjMz0x2KiNSAZF7FtAu4xd0/MrMWQKGZvR3mPejuY2Irm9kpwGVAF+Bo4B0zO9HdqzVQUPv27Vm9ejXr1q2rgV2o2vr1VddZsiT5cdQWmZmZtG/fPt1hiEgNSFqCcPe1wNowvdnMlgDHVLLIQOAFd98JfGZmK4BeQLV+vdW4cWM6dux4gFFX3ymnVF1HjRkRqYsSOsVkZtkHsxEzywJOBf4eiq43s4Vm9pSZHR7KjgFWxSy2mjgJxcxGmtl8M5ufqlaCiEhDlGgfxJ/NbK6ZXWtmLauzATNrDkwFbnL3b4FHgROAHKIWxh+qsz53H+fuue6eWzqukYiI1LyEEoS79wMuBzoQ9SVMNrOzq1rOzBoTJYfn3P2/w7q+dPfd7r4HeILoNBLAmrD+Uu1DmYiIpEHCfRDuvtzM/hOYD4wFTjUzA24vPfjHCvPGA0vc/YGY8nahfwLgJ8AnYfp1YLKZPUDUSd0JiD94kIhIHWUxw+lUxPPzkx5HIhJKEGbWDfgF8GPgbeDCcHXS0USdyPslCKAvcAXwsZmVDkl6OzDUzHIAB4qBfwNw90Vm9iKwmOgKqOuqewWTiIjUnERbEA8DTxK1FraXFrr7F6FVsR93nwnEG5XkzYo24u73AvcmGJOIiCRRognix8D20m/0ZtYIyHT3be4+KWnRiYhI2iR6FdM7QNOY54eGMhERqacSTRCZ7r6l9EmYPjQ5IYmISG2QaILYamZld5sxs57A9krqi4hIHZdoH8RNwEtm9gVRx/P/AYYkLSoREUm7hBKEu88zs5OAzqHoU3cvSV5YIiKSbtUZrC8PyArL9DAz3P2ZpEQlIiJpl+gP5SYRjZ9UBJT+eM0BJQgRkXoq0RZELnCK6y48IiINRqJXMX1C1DEtIiINRKItiCOBxWY2F9hZWujuFyUlKhERSbtEE8ToZAYhIiK1T6KXuU43s+OATu7+jpkdCmQkNzQREUmnRG85ejXwMvB4KDoGeDVZQYmISPol2kl9HdH9Hb6F6OZBwFHJCkpERNIv0QSx092/K31iZocQ/Q5CRETqqUQTxHQzux1oGu5F/RLwRvLCEhGRdEs0QYwC1gEfE90i9E0g7p3kSplZBzN738wWm9kiM7sxlB9hZm+b2fLw9/BQbmY21sxWmNnC2NFjRUQk9RK9imkP8ER4JGoXcEu8YalLAAANt0lEQVS4d3ULoNDM3gaGA++6+31mNooo+dwGnAd0Co/TgEfDXxERSYNEx2L6jDh9Du5+fEXLuPtaYG2Y3mxmS4iufhoI5IdqE4ECogQxEHgmDOcxx8xamVm7sB4REUmx6ozFVCoTGAwckehGzCwLOBX4O9A25qD/L6BtmD4GWBWz2OpQtk+CMLORwEiAY489NtEQRESkmhLqg3D3DTGPNe7+EPDjRJY1s+bAVOAmd/+23Hqdal4N5e7j3D3X3XPbtGlTnUVFRKQaEj3FFNth3IioRVHlsmbWmCg5POfu/x2Kvyw9dWRm7YCvQvkaoEPM4u1DmYiIpEGip5j+EDO9CygGLq1sATMzYDywxN0fiJn1OjAMuC/8fS2m/Hoze4Goc3qT+h9ERNIn0auYzjyAdfcFrgA+NrOiUHY7UWJ40cx+CXzO3kTzJnA+sALYBvziALYpIiI1JNFTTDdXNr9cC6G0bCZgFSzywzj1nWhIDxERqQWqcxVTHtFpIIALgbnA8mQEJSIi6ZdogmgP9HD3zQBmNhr4i7v/PFmBiYhIeiU61EZb4LuY59+x9/cLIiJSDyXagngGmGtmr4Tng4h+BS0iIvVUolcx3WtmbwH9QtEv3P0fyQtLRETSLdFTTACHAt+6+x+B1WbWMUkxiYhILZDoLUfvIBpQ77ehqDHwbLKCEhGR9Eu0BfET4CJgK4C7fwG0SFZQIiKSfokmiO9iB9Yzs2bJC0lERGqDRBPEi2b2ONDKzK4G3qF6Nw8SEZE6JtGrmMaEe1F/C3QGfufubyc1MhERSatEhuzOAN4JA/YpKYiINBBVnmJy993AHjNrmYJ4RESklkj0l9RbiIbtfptwJROAu9+QlKhERCTtEk0Q/x0eIiLSQFSaIMzsWHf/X3fXuEsiIg1MVX0Qr5ZOmNnUJMciIiK1SFUJIvaOcMcnMxAREaldqkoQXsF0lczsKTP7ysw+iSkbbWZrzKwoPM6PmfdbM1thZp+a2YDqbEtERGpeVZ3U3c3sW6KWRNMwTXju7n5YJctOAB4hupdErAfdfUxsgZmdAlwGdAGOBt4xsxPDJbYiSWUFBVXW8fz8pMchUttUmiDcPeNAV+zuH5hZVoLVBwIvuPtO4DMzWwH0Aj480O1L+hRYQZV18j0/6XGIyMGpzv0gasr1ZrYwnII6PJQdA6yKqbM6lO3HzEaa2Xwzm79u3bpkxyoi0mClOkE8CpwA5ABrgT9UdwXuPs7dc909t02bNjUdn4iIBClNEO7+pbvvdvc9RKPB9gqz1gAdYqq2D2UiIpImKU0QZtYu5ulPgNIrnF4HLjOzJuFWpp2AuamMTURE9pXoUBvVZmbPA/nAkWa2GrgDyDezHKJLZouBfwNw90Vm9iKwGNgFXKcrmERE0itpCcLdh8YpHl9J/XuBe5MVj4iIVE86rmISEZE6QAlCRETiUoIQEZG4lCBERCQuJQip18yqfohIfEoQIiISlxKEiIjEpQQhIiJxKUGIiEhcShAiIhKXEoSIiMSlBCEiInElbbA+2auqex7rfsciUhupBSEiInEpQYiISFxKECIiEpcShIiIxJW0BGFmT5nZV2b2SUzZEWb2tpktD38PD+VmZmPNbIWZLTSzHsmKS0REEpPMFsQE4NxyZaOAd929E/BueA5wHtApPEYCjyYxLhERSUAy70n9gZlllSseCOSH6YlAAXBbKH/G3R2YY2atzKydu69NVnwiUjdUdZk46FLxZEl1H0TbmIP+v4C2YfoYYFVMvdWhTERE0iRtndShteDVXc7MRprZfDObv27duiREJiIikPoE8aWZtQMIf78K5WuADjH12oey/bj7OHfPdffcNm3aJDVYEZGGLNUJ4nVgWJgeBrwWU35luJqpN7BJ/Q8iIumVtE5qM3ueqEP6SDNbDdwB3Ae8aGa/BD4HLg3V3wTOB1YA24BfJCuumACrruPVPgMmIlJvJPMqpqEVzPphnLoOXJesWEREpPr0S2oREYlLCUJEROLS/SDqk6r6VdSnIiLVoBaEiIjEpRZEJezOBK50qv5v/UQkRpUXFL6fkjAkDrUgREQkLrUgpM5SC08kudSCEBGRuJQgREQkLiUIERGJS30QIiIJamj9XmpBiIhIXEoQIiISlxKE1F5mlT9EJKmUIEREJC4lCBERiUtXMTUgiVyB4XfUnyswROTgKEHIPnQn1oahwAoqnZ/v+SmJQ2q3tCQIMysGNgO7gV3unmtmRwBTgCygGLjU3b9JR3wiIpLeFsSZ7r4+5vko4F13v8/MRoXnt6UnNBGpCQ3th2X1TW3qpB4ITAzTE4FBaYxFRKTBS1cLwoG/mZkDj7v7OKCtu68N8/8FtI23oJmNBEYCHHvssamIVUTiSaTDanTSo6iXquojgtT0E6UrQZzu7mvM7CjgbTNbGjvT3T0kj/2EZDIOIDc3V21TEZEkSUuCcPc14e9XZvYK0Av40szauftaM2sHfJWO2NKhtnxbEBGJlfI+CDNrZmYtSqeBc4BPgNeBYaHaMOC1VMcmIiJ7paMF0RZ4xaLzl4cAk939f8xsHvCimf0S+By4NA2xiYhIkPIE4e7/BLrHKd8A/DDV8YiISHy16TJXERGpRTTUhlSbFRRUOv/91IRR92gcE6lj1IIQEZG4lCBERCQuJQgREYlLCUJEROJSJ7WI1HkajSA51IIQEZG4lCBERCQunWISqUP0UwpJJbUgREQkLiUIERGJSwlCRETiUoIQEZG41EktUovYnVX1QlfdA13VYIqgARUlMWpBiIhIXEoQIiISlxKEiIjEVev6IMzsXOCPQAbwpLvfl+aQRKoc60fj/Eh9VKtaEGaWAfwJOA84BRhqZqekNyoRkYapViUIoBewwt3/6e7fAS8AA9Mck4hIg2ReiwZuMbOfAue6+1Xh+RXAae5+fUydkcDI8LQz8GkKQzwSWJ/C7aVCfdsn7U/tpv2pHY5z9zZVVap1fRBVcfdxwLh0bNvM5rt7bjq2nSz1bZ+0P7Wb9qduqW2nmNYAHWKetw9lIiKSYrUtQcwDOplZRzP7HnAZ8HqaYxIRaZBq1Skmd99lZtcDfyW6zPUpd1+U5rBipeXUVpLVt33S/tRu2p86pFZ1UouISO1R204xiYhILaEEISIicSlByD7MLMvMPqlgXoGZ1dtL+tLNzC4ys1FV1Mk3s2kVzLvJzA5NTnSpU5f2w8zeNLNWYXpL+Fvh/1BdowQhUku4++sHOfbYTUCdOLBWoc7sh7uf7+4b0x1HsihBlGNmV5rZQjNbYGaTzOxCM/u7mf3DzN4xs7ah3mgzeyp8q/6nmd2Q7tghbvxZZvZeKHvXzI4N9SaEX66XLrclzrqamtkLZrbEzF4BmqZwV8rHkmVmS0Pcy8zsOTP7kZnNMrPlZtYrPD4M79VsM+sclv3AzHJi1jXTzLrXwviHm9kjof4JZjbHzD42s3vKvT/NzezlsL7nLHIDcDTwvpml7H5ACe7XaDP7Tcwyn4TlmpnZX8Jn9RMzG5Ku/aiImd1a+r9tZg+a2Xth+qywr8VmdmR6o0wid9cjPIAuwDLgyPD8COBw9l7tdRXwhzA9GpgNNCH6uf0GoHEtjP8NYFh4PgJ4NUxPAH4as+yW8DcL+CRM30x0qTFAN2AXkJumfcsK288m+mJTCDwFGNF4Xa8ChwGHhPo/AqaG6WHAQ2H6RGB+LY1/OPBIqD8NGBqmr4l5f/KBTUQ/Im0EfAicHuYVl773tWy/RgO/iVnmk7DcJcATMeUt07Uflexfb+ClMD0DmAs0Bu4A/i021nj/Q3X9oRbEvs4i+jCsB3D3r4n+Ef9qZh8DtxIdhEv9xd13hvpfAW1THXA58eLvA0wO8ycBp1djff2BZ8O6FgILay7UA/KZu3/s7nuARcC7Hv1Hfkz0T9kSeCmc/32Qve/VS8AFZtaYKElOSHXgQVXxx+pDFDfsff9KzXX31WE9RXGWTbXq7Fesj4Gzzex+M+vn7ptSEGt1FQI9zewwYCdRQs4F+hEljHpNCaJqDxN9q8sm+saQGTNvZ8z0bmrZDw+rsIvw/ptZI+B76Q0nIbGv956Y53uIXvu7gffdvStwIeG9cvdtwNtE32gvBZ5LVcDlVBX/gaynNnzuqtqvss9aUPq+LAN6ECWKe8zsd8kPtXrcvQT4jKh1N5soKZwJfB9Ykr7IUkMJYl/vAYPNrDWAmR1B9K20dDyoYekKLEHx4p9NNGQJwOXs/dZTDPQM0xcRNZvL+wD4WVhXV6LTTLVZ7Hs1vNy8J4GxwDx3/yaVQR2gOUSnYGDv+1eVzUCL5IRzUIqJEgFm1gPoGKaPBra5+7PA70vrUPv2YwbwG6L/hxlEp/z+EVpJ9ZoSRAyPhvW4F5huZguAB4jOn75kZoXU8mF9K4j/34FfmNlC4ArgxlD9CeCMUK8PsDXOKh8l6hBdAtxF1Nyuzf4/4P+Z2T8o963a3QuBb4Gn0xHYAbgJuDm8b98n6neoyjjgf2pD5245U4EjzGwRcD1RPxlE/RZzzayI6Jz+PaG8tu3HDKAd8KG7fwnsoAGcXgINtSENRPi2WgCcFM6V12oW/Q5gu7u7mV1G1GGtm2dJSqX73KVI0pnZlUQtq5vrQnIIegKPmJkBG4k610VSSi0IERGJS30QIiISlxKEiIjEpQQhIiJxKUGIiEhcShAiIhLX/w+c5D4LL5Y4pAAAAABJRU5ErkJggg==\n",
      "text/plain": [
       "<Figure size 432x288 with 1 Axes>"
      ]
     },
     "metadata": {
      "needs_background": "light"
     },
     "output_type": "display_data"
    }
   ],
   "source": [
    "from numpy import arange\n",
    "from matplotlib import pyplot\n",
    "\n",
    "colors = 'rgbcmyk' # red, green, blue, cyan, magenta, yellow, black\n",
    "\n",
    "def bar_chart(categories, words, counts):\n",
    "    \"Plot a bar chart showing counts for each word by category\"\n",
    "    ind = arange(len(words))\n",
    "    width = 1 / (len(categories) + 1)\n",
    "    bar_groups = []\n",
    "    for c in range(len(categories)):\n",
    "        bars = pyplot.bar(ind+c*width, counts[categories[c]], width,\n",
    "                         color=colors[c % len(colors)])\n",
    "        bar_groups.append(bars)\n",
    "    pyplot.xticks(ind+width, words)\n",
    "    pyplot.legend([b[0] for b in bar_groups], categories, loc='upper left')\n",
    "    pyplot.ylabel('Frequency')\n",
    "    pyplot.title('Frequency of Six Modal Verbs by Genre')\n",
    "    pyplot.show()\n",
    "    \n",
    "genres = ['news', 'religion', 'hobbies', 'government', 'adventure']\n",
    "modals = ['can', 'could', 'may', 'might', 'must', 'will']\n",
    "cfdist = nltk.ConditionalFreqDist(\n",
    "    (genre, word)\n",
    "    for genre in genres\n",
    "    for word in nltk.corpus.brown.words(categories=genre)\n",
    "    if word in modals)\n",
    "\n",
    "counts = {}\n",
    "for genre in genres:\n",
    "    counts[genre] = [cfdist[genre][word] for word in modals]\n",
    "bar_chart(genres, modals, counts)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1、NetworkX\n",
    "NetworkX包定义和操作被称为图的由节点和边组成的结构。（略）"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2、csv\n",
    "语言分析工作往往涉及数据统计表，包括有关词项的信息、试验研究的参与者名单或从语料库提取的语言特征。这里有一个CSV格式的简单的词典片段：\n",
    "sleep, sli:p, v.i, a condition of body and mind ...\n",
    "walk, wo:k, v.intr, progress by lifting and setting down each foot ...\n",
    "wake, weik, intrans, cease to sleep\n",
    "\n",
    "我们可以使用Python的CSV库读写这种格式存储的文件。例如，我们可以打开一个叫做lexicon.csv的CSV 文件[1]，并遍历它的行[2]：\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    ">>> import csv\n",
    ">>> input_file = open(\"lexicon.csv\", \"rb\") \n",
    ">>> for row in csv.reader(input_file): \n",
    "...     print(row)\n",
    "['sleep', 'sli:p', 'v.i', 'a condition of body and mind ...']\n",
    "['walk', 'wo:k', 'v.intr', 'progress by lifting and setting down each foot ...']\n",
    "['wake', 'weik', 'intrans', 'cease to sleep']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "每一行是一个字符串列表。如果字段包含有数值数据，它们将作为字符串出现，所以都必须使用int()或float()转换。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3、NumPy\n",
    "NumPy包对Python中的数值处理提供了大量的支持。NumPy有一个多维数组对象，它可以很容易初始化和访问："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 99,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "4 \n",
      "\n",
      "[[6 7 8]\n",
      " [6 7 8]\n",
      " [6 7 8]] \n",
      "\n",
      "[[7 7 7]\n",
      " [8 8 8]] \n",
      "\n"
     ]
    }
   ],
   "source": [
    "from numpy import array\n",
    "cube = array([ [[0,0,0], [1,1,1], [2,2,2]],\n",
    "               [[3,3,3], [4,4,4], [5,5,5]],\n",
    "               [[6,6,6], [7,7,7], [8,8,8]] ])\n",
    "\n",
    "print(cube[1,1,1],\"\\n\")\n",
    "\n",
    "print(cube[2].transpose(),\"\\n\")\n",
    "\n",
    "print(cube[2,1:],\"\\n\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "NumPy包括线性代数函数。在这里我们进行矩阵的奇异值分解，潜在语义分析中使用的操作，它能帮助识别一个文档集合中的隐含概念。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 102,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[-0.4472136  -0.89442719]\n",
      " [-0.89442719  0.4472136 ]] \n",
      "\n",
      " [ 6.32455532  3.16227766] \n",
      "\n",
      " [[-0.70710678  0.70710678]\n",
      " [-0.70710678 -0.70710678]]\n"
     ]
    }
   ],
   "source": [
    "from numpy import linalg\n",
    "a=array([[4,0], [3,-5]])\n",
    "u,s,vt = linalg.svd(a)\n",
    "print(u,\"\\n\\n\",s,\"\\n\\n\",vt)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "NLTK中的聚类包nltk.cluster中广泛使用NumPy数组，支持包括k-means聚类、高斯EM聚类、组平均凝聚聚类以及聚类分析图。有关详细信息，请输入help(nltk.cluster)。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4、其他Python库\n",
    "还有许多其他的Python库，你可以使用http://pypi.python.org/ 处的Python包索引找到它们。许多库提供了外部软件接口，例如关系数据库（如mysql-python）和大数据集合（如PyLucene）。许多其他库能访问各种文件格式，如PDF、MSWord 和XML（pypdf, pywin32, xml.etree）、RSS 源（如feedparser）以及电子邮箱（如imaplib, email）。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
